This paper presents the TransBoat, a novel omnidirectional unmanned surface vehicle (USV) with a magnetbased docking system for overwater construction with wave disturbances. This is the first such USV that can build overwater structures by transporting modules. The TransBoat incorporates two features designed to reject wave disturbances. First, the TransBoat's expandable body structure can actively transform from a mono-hull into a multi-hull for stabilization in turbulent environments by extending its four outrigger hulls. Second, a real-time nonlinear model predictive control (NMPC) scheme is proposed for all shapes of the TransBoat to enhance its maneuverability and resist disturbance to its movement, based on a nonlinear dynamic model. An experimental approach is proposed to identify the parameters of the dynamic model, and a subsequent trajectory tracking test validates the dynamics, NMPC controller and system mobility. Further, docking experiments identify improved performance in the expanded form of the TransBoat compared with the contracted form, including an increased success rate (of ~ 10%) and reduced docking time (of ~ 40 s on average). Finally, a bridge construction test verifies our system design and the NMPC control method.
translated by 谷歌翻译
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language, which can be achieved with a cascaded system consisting of speech recognition, machine translation and speech synthesis. To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech, which requires strict length control. Previous works usually control the number of words or characters generated by the machine translation model to be similar to the source sentence, without considering the isochronicity of speech as the speech duration of words/characters in different languages varies. In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech. Specifically, we control the speech length of generated sentence by guiding the prediction of each word with the duration information, including the speech duration of itself as well as how much duration is left for the remaining words. We design experiments on four language directions (German -> English, Spanish -> English, Chinese <-> English), and the results show that the proposed method achieves better length control ability on the generated speech than baseline methods. To make up the lack of real-world datasets, we also construct a real-world test set collected from films to provide comprehensive evaluations on the video dubbing task.
translated by 谷歌翻译
In a mixed generalized linear model, the objective is to learn multiple signals from unlabeled observations: each sample comes from exactly one signal, but it is not known which one. We consider the prototypical problem of estimating two statistically independent signals in a mixed generalized linear model with Gaussian covariates. Spectral methods are a popular class of estimators which output the top two eigenvectors of a suitable data-dependent matrix. However, despite the wide applicability, their design is still obtained via heuristic considerations, and the number of samples $n$ needed to guarantee recovery is super-linear in the signal dimension $d$. In this paper, we develop exact asymptotics on spectral methods in the challenging proportional regime in which $n, d$ grow large and their ratio converges to a finite constant. By doing so, we are able to optimize the design of the spectral method, and combine it with a simple linear estimator, in order to minimize the estimation error. Our characterization exploits a mix of tools from random matrices, free probability and the theory of approximate message passing algorithms. Numerical simulations for mixed linear regression and phase retrieval display the advantage enabled by our analysis over existing designs of spectral methods.
translated by 谷歌翻译
像有声读物的综合一样,表达性语音综合仍然对样式表示学习和预测仍然具有挑战性。从参考音频或从文本预测样式标签中得出的标签需要大量标记的数据,这是昂贵的,并且难以准确定义和注释。在本文中,我们提出了一个新颖的框架,以一种自我监督的方式从丰富的纯文本中学习样式表示。它利用情感词典,并使用对比度学习和深度聚类。我们进一步将样式表示形式整合为多式变压器TTS中的条件嵌入。通过预测在同一数据集上训练的样式标签,但通过人类注释,我们的方法根据对声音域内和室外测试集的主观评估来改进结果,从而获得了改进的结果。此外,有了隐性的背景感知样式表示,长期综合音频的情感过渡似乎更自然。音频样本可在演示网络上找到。
translated by 谷歌翻译
我们考虑了二进制隐藏的马尔可夫模型上的高维平均值估计问题,该模型阐明了数据,样本大小,维度和统计推断中信号强度的记忆之间的相互作用。在此模型中,估算器观察$ n $样品的$ d $ dimensional参数vector $ \ theta _ {*} \ in \ mathbb {r}^{d} $,乘以随机符号$ s_i $($ 1 \ \ $ 1 \ \ s_i $) le i \ le n $),并被各向同性标准高斯噪声损坏。标志$ \ {s_ {i} \} _ {i \ in [n]} \ in \ { - 1,1 \}^{n} $是从带有flip概率$ \ flip概率$ \的固定同质马尔可夫链中绘制的delta \ in [0,1/2] $。随着$ \ delta $的变化,该型号顺利地插入了两个认真的模型:高斯定位模型,$ \ delta = 0 $和高斯混合模型,$ \ delta = 1/2 $。假设估算器知道$ \ delta $,我们建立了一个几乎最小的最佳(达到对数因素)估计错误率,作为$ \ | \ theta _ {*} \ |,\ delta,d,d,n $的函数。然后,我们为估计$ \ delta $的情况提供了上限,假设$ \ theta _ {*} $的知识(可能不准确)。当$ \ theta _ {*} $是一个准确已知的常数时,界限被证明是紧身的。然后将这些结果组合到算法中,该算法用$ \ delta $ unknown估算$ \ theta _ {*} $先验,并说明了其错误的理论保证。
translated by 谷歌翻译
Learning efficient and interpretable policies has been a challenging task in reinforcement learning (RL), particularly in the visual RL setting with complex scenes. While neural networks have achieved competitive performance, the resulting policies are often over-parameterized black boxes that are difficult to interpret and deploy efficiently. More recent symbolic RL frameworks have shown that high-level domain-specific programming logic can be designed to handle both policy learning and symbolic planning. However, these approaches rely on coded primitives with little feature learning, and when applied to high-dimensional visual scenes, they can suffer from scalability issues and perform poorly when images have complex object interactions. To address these challenges, we propose \textit{Differentiable Symbolic Expression Search} (DiffSES), a novel symbolic learning approach that discovers discrete symbolic policies using partially differentiable optimization. By using object-level abstractions instead of raw pixel-level inputs, DiffSES is able to leverage the simplicity and scalability advantages of symbolic expressions, while also incorporating the strengths of neural networks for feature learning and optimization. Our experiments demonstrate that DiffSES is able to generate symbolic policies that are simpler and more and scalable than state-of-the-art symbolic RL methods, with a reduced amount of symbolic prior knowledge.
translated by 谷歌翻译
Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at https://resgrad1.github.io/.
translated by 谷歌翻译
Accurate and smooth global navigation satellite system (GNSS) positioning for pedestrians in urban canyons is still a challenge due to the multipath effects and the non-light-of-sight (NLOS) receptions caused by the reflections from surrounding buildings. The recently developed factor graph optimization (FGO) based GNSS positioning method opened a new window for improving urban GNSS positioning by effectively exploiting the measurement redundancy from the historical information to resist the outlier measurements. Unfortunately, the FGO-based GNSS standalone positioning is still challenged in highly urbanized areas. As an extension of the previous FGO-based GNSS positioning method, this paper exploits the potential of the pedestrian dead reckoning (PDR) model in FGO to improve the GNSS standalone positioning performance in urban canyons. Specifically, the relative motion of the pedestrian is estimated based on the raw acceleration measurements from the onboard smartphone inertial measurement unit (IMU) via the PDR algorithm. Then the raw GNSS pseudorange, Doppler measurements, and relative motion from PDR are integrated using the FGO. Given the context of pedestrian navigation with a small acceleration most of the time, a novel soft motion model is proposed to smooth the states involved in the factor graph model. The effectiveness of the proposed method is verified step-by-step through two datasets collected in dense urban canyons of Hong Kong using smartphone-level GNSS receivers. The comparison between the conventional extended Kalman filter, several existing methods, and FGO-based integration is presented. The results reveal that the existing FGO-based GNSS standalone positioning is highly complementary to the PDR's relative motion estimation. Both improved positioning accuracy and trajectory smoothness are obtained with the help of the proposed method.
translated by 谷歌翻译
Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction and planning. As sensors and hardware get improved, there is trending popularity to devise a system that can perform a wide diversity of tasks to fulfill higher-level intelligence. Contemporary approaches resort to either deploying standalone models for individual tasks, or designing a multi-task paradigm with separate heads. These might suffer from accumulative error or negative transfer effect. Instead, we argue that a favorable algorithm framework should be devised and optimized in pursuit of the ultimate goal, i.e. planning of the self-driving-car. Oriented at this goal, we revisit the key components within perception and prediction. We analyze each module and prioritize the tasks hierarchically, such that all these tasks contribute to planning (the goal). To this end, we introduce Unified Autonomous Driving (UniAD), the first comprehensive framework up-to-date that incorporates full-stack driving tasks in one network. It is exquisitely devised to leverage advantages of each module, and provide complementary feature abstractions for agent interaction from a global perspective. Tasks are communicated with unified query design to facilitate each other toward planning. We instantiate UniAD on the challenging nuScenes benchmark. With extensive ablations, the effectiveness of using such a philosophy is proven to surpass previous state-of-the-arts by a large margin in all aspects. The full suite of codebase and models would be available to facilitate future research in the community.
translated by 谷歌翻译
A lot of theoretical and empirical evidence shows that the flatter local minima tend to improve generalization. Adversarial Weight Perturbation (AWP) is an emerging technique to efficiently and effectively find such minima. In AWP we minimize the loss w.r.t. a bounded worst-case perturbation of the model parameters thereby favoring local minima with a small loss in a neighborhood around them. The benefits of AWP, and more generally the connections between flatness and generalization, have been extensively studied for i.i.d. data such as images. In this paper, we extensively study this phenomenon for graph data. Along the way, we first derive a generalization bound for non-i.i.d. node classification tasks. Then we identify a vanishing-gradient issue with all existing formulations of AWP and we propose a new Weighted Truncated AWP (WT-AWP) to alleviate this issue. We show that regularizing graph neural networks with WT-AWP consistently improves both natural and robust generalization across many different graph learning tasks and models.
translated by 谷歌翻译